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Abstract 

In the past few years, the storage and analysis of large-scale and fast evolving networks 
present a great challenge. Therefore, a number of different techniques have been proposed 
for sampling large networks. In general, network exploration techniques approximate 
the original networks more accurately than random node and link selection. Yet, link 
selection with additional subgraph induction step outperforms most other techniques. In 
this paper, we apply subgraph induction also to random walk and forest-hre sampling. 
We analyze different real-world networks and the changes of their properties introduced 
by sampling. We compare several sampling techniques based on the match between the 
original networks and their sampled variants. The results reveal that the techniques 
with subgraph induction underestimate the degree and clustering distribution, while 
overestimate average degree and density of the original networks. Techniques without 
subgraph induction step exhibit exactly the opposite behavior. Hence, the performance of 
the sampling techniques from random selection category compared to network exploration 
sampling does not differ signihcantly, while clear differences exist between the techniques 
with subgraph induction step and the ones without it. 

Keywords: complex networks, network sampling, comparison of sampling techniques, 
subgraph induction, sampling accuracy 


1. Introduction 

Real-world networks are often very large and fast evolving. Therefore, not only 
their storage poses a problem, but their analysis and understanding present a great 
challenge. In the past few years, a number of different techniques have been proposed for 
sampling large networks to allow for their faster and more efficient analysis. However, 
some information about the network is lost through sampling, thus it is of key importance 
to understand the changes in the network structure introduced by sampling. 

Several studies on network sampling analyze the match between the original net¬ 
works and their sampled variants DllE], while only few of them focus on comparing 
the performance of different sampling techniques [HEIE]. In general, network exploration 
techniques like random walk and forest-fire sampling approximate the original networks 
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more accurately than random node and link selection [3] . However, Ahmed et al. |7] pro¬ 
posed link selection with additional subgraph induction step, where the sampled network 
consists of randomly selected links (i.e. random link selection) and any additional links 
between their endpoints (i.e. subgraph induction). In this way, not only the performance 
of random link selection is improved, the proposed technique also outperforms several 
other sampling techniques. 

In this paper, we apply subgraph induction also to random walk and forest-fire sam¬ 
pling. We consider ten real-world networks and analyze the changes of their properties 
introduced by sampling. We compare eight sampling techniques based on the match 
between the original and sampled networks. The results reveal that techniques with sub¬ 
graph induction step tend to underestimate the degree and clustering distribution, while 
other techniques tend to overestimate both properties. Moreover, the techniques with 
subgraph induction overestimate the average degree and density of the original networks, 
in contrast, techniques without subgraph induction underestimate them. It appears that 
the performance of the sampling techniques from random selection category compared 
to network exploration sampling does not differ significantly, while clear differences exist 
between the techniques with subgraph induction step and the ones without subgraph 
induction. 

The rest of the paper is structured as follows. In Section we first present the 
background on network sampling and expose the sampling techniques used in the study. 
The results of the empirical analysis are reported and formally discussed in Section 
while Section 1^ concludes the paper and suggests directions for further research. 

2. Network sampling 

Let the network be represented by a simple undirected graph G = {V,E), where V 
denotes the set of nodes (n = |Id|) and E is the set of links {m = |A|). The goal of 
network sampling is to create a sampled network G' = (W, A'), where V G V^ E' G E 
and n' = \V'\ « n, m! = \E'\ « m. The sample G' is obtained in two steps. In the 
first step, nodes or links are sampled using a particular strategy like random selection 
and network exploration sampling. In the second step, the sampled nodes and links 
are retrieved from the original network. The sampled network is called a subgraph of 
the original network, if it consists of sampled nodes or sampled links only. Otherwise, 
if sampled nodes and all their mutual links are included in the sample or the sample 
consists of sampled links and any additional links among their endpoints, the sampled 
network is called an induced subgraph of the original network (i.e., subgraph induction). 

However, the previous studies have shown that the size n' of the sample affects the 
performance of network sampling. Expectedly, larger samples approximate the original 
networks more accurately |5] . Still, the sample size of 10 — 30% of the original network 
achieves the balance between small sample and accurate approximation laiHiE]. 

The changes of network properties introduced by sampling depend also on the adopted 
sampling technique, since different techniques are suitable for matching different sets of 
network properties. Thus, sampling techniques can be roughly divided into two cat¬ 
egories: random selection and network exploration techniques. In the first category, 
nodes or links are included in the sample uniformly at random or proportional to some 
particular characteristic like degree. In the second category, the sample is constructed 
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(a) RNS (b) RND (c) RLS (d) RLI 


Figure 1: Random selection techniques applied to a small toy network. Highlighted nodes and links 
represent the samples obtained by different techniques. |(a)| In random node selection, the sample consists 
of nodes selected uniformly at random and all their mutual links. |(b)| In random node selection by degree, 
the nodes are selected to the sample with probability proportional to their degree, while all their mutual 
links are included in the sample. |(c)| In random link selection, links are selected to the sample uniformly 
at random. |(d)| In random link selection with subgraph induction, the sample consists of randomly 
selected links (solid lines) and also any additional links between their endpoints (dashed lines). 


by retrieving a neighborhood of a randomly selected seed node using different strategies 
like breadth-first search, random walk and forest-fire. 


2.1. Random selection 

For the purpose of this study, we consider four techniques from the random selection 
category. We first adopt random node selection [4] (RNS), where the sample consists of 
nodes selected uniformly at random and all their mutual links (Fig. |l(a)[ ). RNS accu¬ 
rately approximates the degree mixing and preserves the relationship of transitivity 
and density between the original and sampled networks [5]. Moreover, it shows better 
performance on larger samples than on smaller [^. Yet, RNS overestimates the degree 
and betweenness centrality exponent and fails to match the clustering coefficient [^, 
degree distribution and the average path length m of the original network. 

Furthermore, we adopt random node selection by degree [3] (RND), which improves 
the performance of RNS. Here, the nodes are selected randomly with probability propor¬ 
tional to their degrees and all their mutual links are included in the sample (Fig. |l(b)[ ). 
RND matches in-degree and out-degree distributions and also spectral properties of the 
original network better than RNS [4]. Besides, it constructs samples with larger weakly 
connected component [5]. Nevertheless, despite a fully connected original network, both 
RNS and RND can construct a disconnected sampled network. 

Next, we adopt random link selection [3] (RLS), where the sample consists of links 
selected uniformly at random (Fig. l(c)[ ). RLS matches well degree mixing [5] and the 
distribution of sizes of weakly connected components [4]. It constructs sparse samples 
and accurately approximates the average path length of the original network m- Yet, 
RLS fails to match most of other network properties [3]- Like RNS, RLS overestimates 
the degree and betweenness centrality exponent and underestimates the clustering coef¬ 
ficient [S]. 

We last adopt random link selection with subgraph induction m (RLI). which im¬ 
proves the performance of RLS. Here, the sample consists of links selected uniformly at 
random and any additional links between their endpoints (Fig. |l(d)| ). RLI outperforms 
several other techniques in matching the degree, path length and clustering coefficient 
distribution of the original networks [7]. It selects nodes with higher degree more likely 
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(b) RWI 


(c) FFS 


(d) FFI 


Figure 2: Network exploration techniques applied to a small toy network. Highlighted nodes and links 
represent the samples obtained by different techniques. |(a)| In random walk sampling, the sample consists 
of links, retrieved from a simulation of a random walker on the network starting at a randomly selected 
seed node. |(b)| In random walk sampling with subgraph induction, the sample consists of links selected 
with random walk sampling (solid lines) and also any additional links between their endpoints (dashed 
lines). |(c)| In forest-fire sampling, the broad neighborhood of a randomly selected seed node is retrieved 
using partial breadth-first search, where only a fraction of links is included in the sample on each step. 
|(d)| In forest-fire sampling with subgraph induction, the sample consists of links selected with forest-fire 
sampling (solid lines) and also any additional links between their endpoints (dashed lines). 


than other random selection techniques, which increases the connectivity of the sample. 
Moreover, RLI is suitable for sampling large networks that can not fit into the main mem¬ 
ory and can also be implemented as a technique for sampling streaming networks m- 


2.2. Network exploration 

We consider four sampling techniques from the network exploration category (note 
that in the literature, this category of sampling techniques is also called topology based 
sampling [7], traversal based sampling |12j or link-trace sampling HSl). First, we adopt 
random walk sampling [3] (RWS), where the random walk is simulated on the network, 
starting at a randomly selected seed node (Fig. |2(a)] ). The sample consists of links, which 
are visited by a random walker and represents a connected subgraph of the original net¬ 
work. RWS outperforms random selection techniques in matching the transitivity iia, 
clustering coefficient distribution and spectral properties and also shows good perfor¬ 
mance on smaller samples [4j. Yet, RWS is biased towards selecting nodes with high 
degree m and fails to match the degree distribution [ID- 

Next, we adopt forest-fire sampling [3] (FFS). Here, a broad neighborhood of a ran¬ 
domly selected seed node is retrieved from partial breadth-first search (Fig. 2(c) I. The 
number of links sampled on each step is selected from a geometric distribution with mean 
p !(1 —p), where p is set to 0.7 [4]. Thus, on average 2.33 links are included in the sample 
on each step. FFS matches well spectral properties [3] and together with RWS shows 
the best overall performance among several techniques [3]. However, FFS fails to match 
the path length and clustering coefficient of the original networks m- 

Moreover, we apply subgraph induction step to random walk and forest-fire sampling, 
which we term random walk sampling with subgraph induction (RWI) and forest-fire 
sampling with subgraph induction (FFI). Here, the sample consists of links, sampled 
with random walk (Fig. |2(b)[ ) or forest-fire sampling (Fig. |2(d)[ ), while any additional 
links among the endpoints of sampled links are also included in the sample. To the 
best of our knowledge, RWI has not been analysed in any of the previous studies. On 
the other hand, FFI shows worse performance than RLI in matching the path length. 
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degree and clustering distributions m- Still, the performance of FFI has not yet been 
compared to a larger set of sampling techniques. 


3. Analysis and discussion 


In the following sections, we present the adopted framework for comparison of sam¬ 
pling techniques (Section 3.1), report the results of the empirical analysis and formally 
discuss the findings (Section 3.2). 


3.1. Statistical comparison 

Sampling techniques are compared through four network properties (see Section 3.2) 
on ten real-world networks (see Table [^. For each property, we compute externally stu- 
dentized residuals of the properties of sampled networks that measure the consistency of 
each sampling technique with the rest. We expose statistically significant inconsistencies 
between the techniques with two-tailed Student t—test [16] and calculate the residuals 
as introduced below (see also [HD- 

Let Xij denote some property value of j—th network for i—th sampling technique. 
Externally studentized residual Xij is calculated as 


A _ 

~ - /-T 

N 


( 1 ) 


where N is the number of sampling techniques {N = 8 in our case), fltj is the mean 
value and Sij is the standard deviation of the considered property. The mean value and 
standard deviation are computed for all sampling techniques, excluding the observed one 
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Assuming that the errors in x are independent and normally distributed, the resid¬ 
uals X have Student t—distribution with N — 2 degrees of freedom. Statistically signifi¬ 
cant inconsistencies between the sampling techniques are revealed by two-tailed Student 
t—test [IS] at P—value of 0.05, rejecting the null hypothesis that the values of the con¬ 
sidered property are consistent across the sampling techniques. 


3.2. Empirical analysis 

The empirical analysis is performed on ten social and information networks. Their 
main characteristics are presented in Table [^ Due to a large number of networks con¬ 
sidered, the detailed description is omitted. Networks are considered to be undirected, 
although some of them are directed. For each network, we perform 100 realizations of 
each sampling technique, where we consider sample sizes of 15% of the original networks 
as suggested in [li]. For each run of the network exploration techniques, the sample 
was constructed from a new randomly selected seed node. 
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Table 1: Real-world networks considered in the study. 


Network 

Description 

Nodes 

Links 

Average 

degree 

Clustering 

coefficient 

Density 

ca-hep 

High E. Phys. collaboration |19| 

12,008 

237,010 

39.5 

0.660 

3.3 X 10“^ 

ca-astro 

Astro Phys. collaboration 1191 

18,772 

396,160 

42.2 

0.318 

2.2 X 10“^ 

cit-hep 

High E. Phys. citation 1201 

27,240 

342,437 

25.1 

0.120 

9.2 X 10“^ 

brightkite 

Brightkite friendship 1211 

58,228 

214,078 

7.4 

0.111 

1.3 X 10“^ 

slashdot 

Slashdot friendship 1221 

82,168 

948,464 

23.1 

0.024 

2.8 X 10“^ 

fiickr 

Fiickr images metadata |23| 

105,938 

2,316,948 

43.7 

0.402 

4.1 X 10“^ 

ca-dblp 

DBLP collaboration |24| 

317,080 

1,049,866 

6.6 

0.306 

2.1 X 10“'^ 

nd. edu 

Web graph of nd.edu |25| 

325,729 

1,497,134 

9.1 

0.097 

2.8 X 10“'^ 

youtube 

Youtube friendship 1261 

1,134,890 

2,987,624 

5.2 

0.006 

4.5 X 10“® 

road-tx 

Texas road network |22| 

1,379,917 

1,921,660 

5.5 

0.060 

3.9 X 10“® 


We analyze different properties of the original and sampled networks, including degree 
distribution (probability distribution of degrees of all nodes), distribution of clustering 
coefficient (probability distribution of the proportions of connected neighbors of each 
node [E]), average degree (average number of neighbors of nodes over the whole network), 
and density (the ratio of existing links to all possible links). We compare sampling 
techniques based on the match between the original and sampled networks; Section 3.2.1| 


reports the results for degree and clustering distributions, while in Section 3.2.2 
analyze the average degree and density. 


we 


3.2.1. Degree and clustering coefficient distributions 

We first analyze the performance of sampling techniques based on the match of the 
degree and clustering coefficient distributions between the original and sampled networks. 
To compare the distributions of properties, we use Kolmogorov-Smirnov Z?—statistics, 
which is commonly used in similar studies HHHE]. Kolmogorov-Smirnov test checks the 
null hypothesis that the distributions of property of the original network and its sampled 
variant are the same, while the ZZ—statistics measures the distance between the observed 
distributions. Based on the values of Kolmogorov-Smirnov ZD—statistics, we compare the 


adopted sampling techniques (see Section 3.1). 


The comparison of sampling techniques based on the degree distribution is shown 
in Fig. 1^ We observe that in most cases, the difference between the techniques with 
subgraph induction step (i.e., RNS, RND, RLI, RWI, FFI) and those without subgraph 
induction (i.e., RLS, RWS, FFS) is clear. The first group of techniques approximates the 
degree distribution of the original networks more accurately. In addition, the techniques 
with subgraph induction step improve the performance of the corresponding techniques 
without subgraph induction in the case of RLS and FFS. Since the subgraph induction 
increases the degrees of the nodes in the sample [7], this clearly contributes to better 
match the degree distribution between the original and sampled networks. 

RNS and RND show comparable accuracy to network exploration techniques with 
subgraph induction, even though they construct disconnected samples. The latter in¬ 
dicates that the connectivity of the sample does not affect the match of the degree 
distribution. On the other hand, among the techniques without subgraph induction, 
RWS shows the best performance, which could be explained by its bias towards selecting 
high degree nodes and exploring densely connected parts of the network [3]. In contrast, 
the samples constructed by FFS are sparse, a large fraction of the nodes in the samples 
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has low degree, while the number of nodes with higher degree is underestimated [7]. 
Accordingly, FFS is the least accurate among all techniques. 

Moreover, we compare the sampling techniques based on the clustering coefficient 
distribution. The results are presented in Fig.[^ The techniques show more comparable 
accuracy than in the case of the degree distribution. In general, the techniques with 
subgraph induction underestimates the clustering distribution, while others overestimate 
it. FFS and RWS prove as the best performing techniques. This indicates that in the case 
of matching the clustering distribution of the original networks, including all the links 
among sampled nodes is not the best choice. However, RNS shows the worst performance, 
which could be explained by its tendency to construct samples with a large number of 
low clustered nodes [7]. 

In summary, the sampling techniques with subgraph induction step underestimate 
the degree and clustering coefficient distributions of the original networks. On the other 
hand, the techniques without subgraph induction, overestimates both properties. The 
best overall performance is provided by RWI, which shows the most stable accuracy for 
matching both distributions on all considered networks. However, it appears that the 
properties of the original networks have relevant effect on the accuracy of the sampling 
techniques. For example, the techniques with subgraph induction match the degree 
distribution of the networks with larger average degree more accurately than when the 
average degree is lower and the techniques without subgraph induction also perform well. 
We aim to study that kind of relations in future work, since they should be observed in 
an even larger set of real-world networks to obtain statistically significant results. 


3.2.2. Average degree and density 

In the second part of the analysis, we study the performance of sampling techniques 
based on the match of the average degree and density between the original and sampled 
networks. We compare the properties between networks based on the actual values of 
properties. Furthermore, we modify Eq. 0 and 0 and use the true value of considered 
property of the original networks instead of the mean values /ty (see Section 3.11 to 
compare the sampling techniques. 

Fig. shows the comparison of sampling techniques based on the average degree. We 
observe that the techniques with subgraph induction tend to overestimate the average 
degree, while others underestimate it. Among the techniques with induction step, only 
RNS underestimate the average degree. The samples constructed with RNS consist of a 
large fraction of low-degree nodes [7] and have a low average degree particularly in the 
smaller samples [S]. Therefore, RNS underestimates the average degree of the networks, 
especially compared to the techniques, which are biased in selecting nodes with higher 
degree (i.e., RND). 

Lastly, we analyze the density of the networks. Fig. shows the comparison of the 
sampling techniques based on the density. In general, all techniques overestimate the 
density, yet the techniques without subgraph induction step approximate the density 
more accurately than the techniques with subgraph induction. Additionally, techniques 
without subgraph induction improve the performance of the corresponding techniques 
with subgraph induction. In the previous study |27j . we proved the power-law relation¬ 
ship between the size and density of real-world networks and their sampled variants. 
The results show that the network density decreases with its size. However, the tech¬ 
niques with subgraph induction do follow the latter relationship, while the techniques 
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without induction construct sparser samples than expected. Therefore, the accuracy of 
the sampling techniques based on the density is relative and depends on whether the 
samples should accurately match the density of the original networks or the relationship 
between the size and density of the original networks and their sampled variants should 
be preserved. 

In addition to all above, we analyze the performance of the techniques with partial 
subgraph induction, where we include a different portion of mutual links between the 
sampled nodes in the sample (i.e., we randomly select 10% — 90% links from all possible 
links). The results are not described, since the techniques with partial subgraph induction 
did not improve the performance of the techniques with subgraph induction and they 
did not perform worse than techniques without subgraph induction. 

In summary, the sampling techniques with subgraph induction step tend to overes¬ 
timate the average degree and the density of the original networks. In contrast, the 
techniques without subgraph induction underestimate the average degree and approxi¬ 
mate the density of the original networks more accurately than other techniques. Yet, 
the accuracy of sampling techniques does not depend only on the characteristics of the 
adopted technique, but also on the characteristics of the original networks. However, 
from exploration sampling techniques, the best performance is provided by RWI, while 
the best overall performance is provided by RNS, which shows very stable accuracy in 
matching considered properties. 

4. Conclusion 

In this paper, we analyze different real-world networks and study the changes of their 
properties introduced by network sampling. We consider eight sampling techniques with 
and without subgraph induction step and compare them based on the match of properties 
between the original networks and their sampled variants. 

The results reveal that the techniques with subgraph induction underestimate the de¬ 
gree and clustering distribution, while they overestimate the average degree and density 
of the original networks. On the other hand, the techniques without subgraph induc¬ 
tion step exhibit opposite behavior, since they overestimate the degree and clustering 
distribution and underestimate the average degree and density. The techniques without 
subgraph induction approximate the density most accurately. However, from the ran¬ 
dom selection category, random node selection shows the best overall performance, while 
from the network exploration category, the most stable accuracy is provided by random 
walk sampling with subgraph induction. Still, it appears that the performance of the 
sampling techniques from random selection category compared to network exploration 
sampling does not differ significantly, while clear differences exist between the techniques 
with subgraph induction step and the ones without subgraph induction. Following this, 
a suitable classification of network sampling techniques would be a class of techniques 
with subgraph induction and a class without it. 

Some of the results suggest that the characteristics of the original networks affect the 
accuracy of the network sampling. Thus, our future work will mainly focus on analyzing 
relations between network properties and accuracy of the sampling techniques on a larger 
set of real-world networks. Besides, a prominent direction for further study is the analysis 
of the time and space efficiency of sampling techniques, since, for example, fitting even 



a sampled network in the main memory becomes challenging with significant growth of 
real-world networks in the past few years. 
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(a) Degree distribution 



Externally studentized residuals 

(b) Degree distribution 


Figure 3: Statistical comparison of the degree distribution of the original networks and their sam¬ 
pled variants obtained by different sampling techniques. We show externally studentized residuals that 
measure the consistency of each sampling technique with the rest and expose statistically si gnifi cant in¬ 
consistencies between the techniques with two tailed Student t—test at P—value of 0.05 (see |(a)[ shaded 
region corresponds to 95% confidence intervals). Merely for complete understanding, we present average 
externally studentized residuals over all networks (see |(b)[ the star marker indicate the residuals for the 
original networks, while error bars display standard deviations). Notice that techniques with subgrah 
induction step (full markers) approximate the degree distribution more accurately than the techniques 
without subgraph induction (empty markers). 








(a) Clustering coefficient distribution 



Externally studentized residuals 

(b) Clustering coefficient distribution 


Figure 4: Statistical comparison of the clustering coefficient distribution of the original networks and their 
sampled variants obtained by different sampling techniques. We show externally studentized residuals 
that measure the consistency of each sampling technique with the rest and expose statisticall y sig nificant 
inconsistencies between the techniques with two tailed Student t—test at P—value of 0.05 (see |(a)| shaded 
region corresponds to 95% confidence intervals). Merely for complete understanding, we present average 
externally studentized residuals over all networks (see |(b)[ the star marker indicate the residuals for the 
original networks, while error bars display standard deviations). Notice that techniques with subgrah 
induction step (full markers) improve the performance of the corresponding techniques without subgraph 
induction (empty markers). ^2 







(a) Average degree 



Externally studentized residuals 

(b) Average degree 


Figure 5: Statistical comparison of the average degree of the original networks and their sampled variants 
obtained by different sampling techniques. We show externally studentized residuals that measure the 
consistency of each sampling technique with the rest and expose statistically signi ficant inconsistencies 
between the techniques with two tailed Student t—test at P—value of 0.05 (see |(a)[ shaded region corre¬ 
sponds to 95% confidence intervals). Merely for complete understanding, we present average externally 
studentized residuals over all networks (see |(b)[ the star marker indicates the residuals for the original 
networks, while error bars display standard deviations). Notice that techniques with subgrah induc¬ 
tion step (full markers) tend to overestimate the average degree, while the techniques without subgraph 
induction (empty markers) underestimate it. 







(a) Density 
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Figure 6: Statistical comparison of the density of the original networks and their sampled variants ob¬ 
tained by different sampling techniques. We show externally studentized residuals that measure the 
consistency of each sampling technique with the rest and expose statistically signi ficant inconsistencies 
between the techniques with two tailed Student t—test at P—value of 0.05 (see |(a)[ shaded region corre¬ 
sponds to 95% confidence intervals). Merely for complete understanding, we present average externally 
studentized residuals over all networks (see |(b)[ the star marker indicates the residuals for the original 
networks, while error bars display standard deviations). Notice that the techniques without subgraph 
induction step (empty markers) approximate the density more accurately than the corresponding tech¬ 
niques with subgraph induction (full markers). 






